首页> 外文OA文献 >Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer

【2h】

Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer

机译：探索流媒体端到端语音的体系结构，数据和单元使用RNN-Transducer识别

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

We investigate training end-to-end speech recognition models with therecurrent neural network transducer (RNN-T): a streaming, all-neural,sequence-to-sequence architecture which jointly learns acoustic and languagemodel components from transcribed acoustic data. We explore various modelarchitectures and demonstrate how the model can be improved further ifadditional text or pronunciation data are available. The model consists of an`encoder', which is initialized from a connectionist temporalclassification-based (CTC) acoustic model, and a `decoder' which is partiallyinitialized from a recurrent neural network language model trained on text dataalone. The entire neural network is trained with the RNN-T loss and directlyoutputs the recognized transcript as a sequence of graphemes, thus performingend-to-end speech recognition. We find that performance can be improved furtherthrough the use of sub-word units (`wordpieces') which capture longer contextand significantly reduce substitution errors. The best RNN-T system, atwelve-layer LSTM encoder with a two-layer LSTM decoder trained with 30,000wordpieces as output targets achieves a word error rate of 8.5\% onvoice-search and 5.2\% on voice-dictation tasks and is comparable to astate-of-the-art baseline at 8.3\% on voice-search and 5.4\% voice-dictation.

机译：我们研究了使用当前的神经网络传感器（RNN-T）训练的端到端语音识别模型：一种流式，全神经，序列到序列的体系结构，可以从转录的声学数据中共同学习声学和语言模型成分。我们探索了各种模型体系结构，并演示了如果可以使用其他文本或发音数据，可以如何进一步改进模型。该模型由“编码器”和“解码器”组成，该“编码器”是从基于连接器的基于时间分类的（CTC）声学模型初始化的，而“解码器”是从在文本数据上训练的循环神经网络语言模型部分初始化的。整个神经网络都经过RNN-T损失训练，并直接将识别的转录本作为一组字素进行输出，从而执行端到端语音识别。我们发现，通过使用子词单元（“词件”）可以进一步提高性能，这些子词单元捕获更长的上下文并显着减少替换错误。最佳的RNN-T系统，十二层LSTM编码器和训练有30,000个单词作为输出目标的两层LSTM解码器的十二层LSTM编码器，语音搜索的单词错误率达8.5 \％，语音命令的单词错误率达5.2 \％，具有可比性语音搜索的最新基准为8.3％，语音命令的基准为5.4％。

著录项

作者
Rao, Kanishka; Sak, Haşim; Prabhavalkar, Rohit;
展开▼
作者单位

展开▼
年度 2018
总页数
原文格式 PDF
正文语种
中图分类

相似文献

外文文献
中文文献
专利

1. Multi-Stream End-to-End Speech Recognition [J] . Ruizhi Li, Xiaofei Wang, Sri Harish Mallidi, Audio, Speech, and Language Processing, IEEE/ACM Transactions on . 2020,第期

机译：多流端到端语音识别
2. Exploring end-to-end framework towards Khasi speech recognition system [J] . Syiem Bronson, Singh L. Joyprakash International journal of speech technology . 2021,第2期

机译：探索响起Khasi语音识别系统的端到端框架
3. Hybrid CTC/Attention Architecture for End-to-End Speech Recognition [J] . Shinji Watanabe, Takaaki Hori, Suyoun Kim, Selected Topics in Signal Processing, IEEE Journal of . 2017,第8期

机译：端到端语音识别的混合CTC /注意架构
4. Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer [C] . Kanishka Rao, Haşim Sak, Rohit Prabhavalkar 2017 IEEE Automatic Speech Recognition and Understanding Workshop . 2017

机译：探索用于使用RNN转换器流式传输端到端语音识别的体系结构，数据和单元
5. End-to-End Speech Recognition on Conversations [D] . Kim, Suyoun . 2019

机译：对话的端到端语音识别
6. Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition [O] . Aleksandr Laptev, Andrei Andrusenko, Ivan Podluzhny, 2021

机译：用BPE-ropout进行动态声学单元增强用于低资源端到端语音识别
7. Rnn-transducer With Language Bias For End-to-end Mandarin-English Code-switching Speech Recognition [O] . Shuai Zhang, Jiangyan Yi, Zhengkun Tian, 2021

机译：具有语言偏见的RNN-Cransducer用于端到端的普通话 - 英语代码切换语音识别

Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer

摘要

著录项

相似文献

相关主题

期刊订阅